Term Deposit Predictor¶

by Jackson Lu, Daniel Yorke, Charlene Chin , and Mohammed Ibrahim 2025/11/21

Summary¶

This project focuses on predicting whether clients will subscribe to a term deposit using the Bank Marketing dataset. A logistic regression model was developed, incorporating all available predictor variables after appropriate preprocessing. The model was evaluated using shuffled cross-validation with an emphasis on the F1 score balance precision and recall. The analysis was conducted using Python and key libraries such as NumPy, pandas, and scikit-learn, with all code documented for reproducibility. Our final classifier performed fairly well on an unseen test data set, achieving an accuracy of 0.844, f1-score of 0.551, and roc-auc score of 0.91. This indicates that the model is reasonably effective at identifying clients who will subscribe to a term deposit, although there is room for improvement, particularly in recall. Further refinements could involve exploring additional features, tuning hyperparameters, or experimenting with alternative modeling techniques to enhance predictive performance.

Introduction¶

Financial institutions rely heavily on effective marketing strategies to identify which clients are most likely to subscribe to long-term financial products such as term deposits. These products support both customer financial planning and bank stability, yet subscription rates are often low due to ineffective targeting. Traditional marketing approaches depend heavily on human judgment, intuition, and repeated client contact, which can be costly, time-consuming, and inconsistent in effectiveness. As a result, developing more objective and data-driven methods for understanding and predicting client behaviour has become increasingly important.

In this project, we ask whether a machine learning algorithm can accurately predict whether a bank client will subscribe to a term deposit based on demographic attributes, financial information, and past marketing interactions. This question is important because traditional marketing strategies tend to rely on broad outreach rather than individualized prediction, leading to inefficiencies and potential client fatigue. Furthermore, understanding which client characteristics are associated with subscription behavior may support more personalized communication strategies and improve customer experience. If a machine learning classifier such as logistic regression can reliably predict subscription outcomes, it may enable more data-driven, scalable, and cost-effective marketing decisions, ultimately improving the performance of future campaigns.

Methods¶

Data¶

The dataset used in this project is the Bank Marketing dataset, created by By Sérgio Moro, P. Cortez, P. Rita. in 2014 at the University of Minho in Portugal as part of a series of direct marketing campaigns conducted by a Portuguese banking institution. The data is publicly available through the UCI Machine Learning Repository and contains information on client demographics, financial status, and details related to previous marketing contacts. The dataset can be found here.

The dataset contains 45,211 observations and 17 columns in total, comprising 16 predictor variables and 1 binary target variable (y) indicating whether the client subscribed to a term deposit. Each record represents a client who was contacted during a marketing campaign. The predictor variables capture a mix of demographic, financial, and campaign-related information. Among these, several features contain missing values (e.g., job, education, contact, and poutcome), requiring appropriate imputation or handling during preprocessing. Missing categorical values were imputed with a constant placeholder (“unknown”), and numerical features were standardized using StandardScaler to ensure comparability across variables. The target variable y is binary (yes or no), with only around 11–12% of the clients subscribing to a term deposit, resulting in a class imbalance that must be considered in model evaluation. Together, these attributes provide a rich and diverse feature set for assessing whether logistic regression can effectively capture the patterns associated with successful term-deposit subscriptions.

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Analysis¶

A logistic regression classifier was developed to model the probability that a client would subscribe to a term deposit (y). All predictor variables from the original dataset were included after appropriate preprocessing, which involved encoding categorical features with OneHotEncoder and scaling numerical features using StandardScaler. The dataset was randomly divided into a training set (80%) and a test set (20%) to enable unbiased performance evaluation.

Prior exploratory analysis examined the distributions of all input variables in the training set, with plots colored by the binary outcome (“yes” or “no”). Most numerical predictors—such as previous, pdays, campaign, duration, age, and balance—displayed substantial overlap between the two classes. However, some features, particularly duration, showed clear differences: clients who subscribed tended to have significantly longer call durations. This observation is consistent with findings from the original dataset documentation, confirming duration as a strong predictor of subscription. Other variables, such as campaign, previous, and pdays, were highly right-skewed with long tails, while categorical variables (e.g., job, marital status, education, and contact type) appeared to carry complementary contextual information about clients. These exploratory patterns were visualized in Figure 1, which displays feature distributions by subscription status. Figure 2 presents the correlation matrix among numerical predictors.

Correlation matrices (both Pearson and Spearman) were also examined to assess relationships among predictors. Overall, correlations between numerical features were weak, indicating low multicollinearity, which supports the use of logistic regression as an interpretable linear model. Some moderate associations were found among pdays, previous, and campaign, reflecting their shared connection to marketing contact history.

Model evaluation was conducted using stratified 5-fold cross-validation to address class imbalance. Performance was primarily assessed using the F1-score, which balances precision and recall, along with accuracy and ROC-AUC for comprehensive evaluation. Across the five folds, the model achieved a mean accuracy of 0.844, a mean F1-score of 0.551, and a mean ROC-AUC of 0.910. Training and test results were closely aligned, indicating minimal overfitting. These results suggest that the logistic regression model provides strong discriminatory ability, though recall could be improved by further class rebalancing or feature engineering.

All analysis was conducted in Python (Van Rossum & Drake, 2009) using NumPy (Harris et al., 2020), pandas (McKinney, 2010), scikit-learn (Pedregosa et al., 2011), and Altair for visualization. All code for data processing, modeling, and figure generation is documented within this notebook for reproducibility.

Results and Discussion¶

The results demonstrate that logistic regression can effectively distinguish clients likely to subscribe to a term deposit, achieving strong performance across multiple evaluation metrics. The identification of duration as the most influential predictor aligns with expectations—longer calls typically indicate higher engagement and interest in the product. The moderate F1-score, however, reflects difficulty in recalling all positive cases, which was anticipated due to the dataset’s pronounced class imbalance (only around 11–12% subscribed).

These findings highlight the model’s practical potential: banks could apply such a model to prioritize high-probability clients, improving campaign efficiency while reducing unnecessary contact costs. The high ROC-AUC value (0.91) suggests that even a simple, interpretable model can meaningfully support decision-making in marketing strategy.

Future work could explore whether non-linear models (e.g., tree-based or ensemble methods) further improve recall, or whether feature engineering on time-related or interaction variables enhances predictive performance. In addition, investigating the relative influence of demographic versus campaign-related features could deepen understanding of what drives client subscription behavior.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import StratifiedKFold
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
import altair as alt
import altair_ally as ally
from altair import datum

# Enable Altair to render in Jupyter
alt.data_transformers.enable('json', prefix='data/altair/')
Out[1]:
DataTransformerRegistry.enable('json')
In [3]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
In [4]:
# Define the folder path
folder_path = './data/'
altair_path = './data/altair/'

# Ensure the directory exists (create it if it doesn't)
os.makedirs(folder_path, exist_ok=True)
os.makedirs(altair_path, exist_ok=True)

# Define file paths
features_file_path = os.path.join(folder_path, 'bank_marketing_features.csv')
targets_file_path = os.path.join(folder_path, 'bank_marketing_targets.csv')

# Export the DataFrames to CSV
X.to_csv(features_file_path, index=False) # index=False prevents pandas from writing row indices to the file
y.to_csv(targets_file_path, index=False)

df = pd.concat([X, y], axis=1)
In [5]:
# to ignore warning messages from python ally
warnings.filterwarnings(
    "ignore",
    message="You passed a `<class 'narwhals.stable.v1.DataFrame'>` to `is_pandas_dataframe`.",
    category=UserWarning,
    module="altair.utils.data"
)
In [6]:
df.head()
Out[6]:
age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome y
0 58 management married tertiary no 2143 yes no NaN 5 may 261 1 -1 0 NaN no
1 44 technician single secondary no 29 yes no NaN 5 may 151 1 -1 0 NaN no
2 33 entrepreneur married secondary no 2 yes yes NaN 5 may 76 1 -1 0 NaN no
3 47 blue-collar married NaN no 1506 yes no NaN 5 may 92 1 -1 0 NaN no
4 33 NaN single NaN no 1 no no NaN 5 may 198 1 -1 0 NaN no
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          45211 non-null  int64 
 1   job          44923 non-null  object
 2   marital      45211 non-null  object
 3   education    43354 non-null  object
 4   default      45211 non-null  object
 5   balance      45211 non-null  int64 
 6   housing      45211 non-null  object
 7   loan         45211 non-null  object
 8   contact      32191 non-null  object
 9   day_of_week  45211 non-null  int64 
 10  month        45211 non-null  object
 11  duration     45211 non-null  int64 
 12  campaign     45211 non-null  int64 
 13  pdays        45211 non-null  int64 
 14  previous     45211 non-null  int64 
 15  poutcome     8252 non-null   object
 16  y            45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB
In [9]:
df.describe()
Out[9]:
age balance day_of_week duration campaign pdays previous
count 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000
mean 40.936210 1362.272058 15.806419 258.163080 2.763841 40.197828 0.580323
std 10.618762 3044.765829 8.322476 257.527812 3.098021 100.128746 2.303441
min 18.000000 -8019.000000 1.000000 0.000000 1.000000 -1.000000 0.000000
25% 33.000000 72.000000 8.000000 103.000000 1.000000 -1.000000 0.000000
50% 39.000000 448.000000 16.000000 180.000000 2.000000 -1.000000 0.000000
75% 48.000000 1428.000000 21.000000 319.000000 3.000000 -1.000000 0.000000
max 95.000000 102127.000000 31.000000 4918.000000 63.000000 871.000000 275.000000

Here we are showing different features distributions

In [ ]:
ally.alt.data_transformers.enable('vegafusion')
ally.dist(df, color='y')
Out[ ]:

Figure 1. Key Feature Distributions

Here we are showing correlations between different features.

In [ ]:
ally.corr(df)
Out[ ]:

Figure 2. Feature Correlations

We created piplines to carry out transformation on numerical and categorical features separately. The numerical features were standardized using StandardScaler, while the categorical features were encoded using OneHotEncoder. The final pipeline combined these preprocessing steps with the LogisticRegression model.

In [11]:
# Simple pipeline example
numeric_pipeline = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)

categorical_pipeline = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='unknown'),
    OneHotEncoder(drop='first')
)
In [12]:
# First, let's prepare the data
# Handle categorical variables in features
categorical_columns = X.select_dtypes(include=['object']).columns.tolist()
numerical_columns = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Categorical columns: {categorical_columns}")
print(f"Numerical columns: {numerical_columns}")
Categorical columns: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
Numerical columns: ['age', 'balance', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous']
In [13]:
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numerical_columns),
        ('cat', categorical_pipeline, categorical_columns)
    ])
In [14]:
full_pipeline = make_pipeline(
    preprocessor,
    LogisticRegression(random_state=522, max_iter=2000, class_weight="balanced")
)
In [16]:
# Prepare target variable
# LabelEncoder just creates a simple mapping - no statistics involved
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y.values.ravel())

# What it does:
# 'no'  → 0
# 'yes' → 1

print(f"Target classes: {label_encoder.classes_}")
# Note: ravel() converts your 2D DataFrame column (522, 1) into a 1D array (522,) so LabelEncoder can process it properly!
Target classes: ['no' 'yes']
In [17]:
# Split the data
# 'stratify=y_encoded' ensures that your train and test sets have the same class distribution as your original dataset.
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=522, stratify=y_encoded
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
Training set size: (36168, 16)
Test set size: (9043, 16)
In [ ]:
# Use stratified CV for imbalanced data
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=522)
cv_results = cross_validate(
    full_pipeline,
    X,
    y_encoded,
    cv=skf,  # ← Use stratified splits!
    scoring={'accuracy': 'accuracy', 'f1': 'f1', 'peri''roc_auc': 'roc_auc'},
    return_train_score=True,
    n_jobs=-1
)
In [27]:
pd.DataFrame(cv_results).agg(['mean', 'std']).round(3).T
Out[27]:
mean std
fit_time 0.380 0.012
score_time 0.087 0.006
test_accuracy 0.844 0.004
train_accuracy 0.845 0.001
test_f1 0.551 0.008
train_f1 0.554 0.002
test_roc_auc 0.910 0.004
train_roc_auc 0.911 0.001

Table 1. Cross-validation performance metrics for logistic regression model

Our prediction model performed quite well on test data, with a final overall accuracy of 0.844 and F1 score of 0.551. The ROC-AUC score of 0.91 indicates that the model is effective at distinguishing between clients who will and will not subscribe to a term deposit. However, there is room for improvement in identifying all potential subscribers, as some were missed by the model.

References¶

Bera, Suman, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. 2019. “Fair Algorithms for Clustering.” https://www.semanticscholar.org/paper/Fair-Algorithms-for-Clustering-Bera-Chakrabarty/34a46c62cb3a7809db4ed7d0c1a651f538b9fe87

Ziko, Imtiaz, Eric Granger, Jing Yuan, and Ismail Ayed. 2019. “Clustering with Fairness Constraints: A Flexible and Scalable Approach.” https://www.semanticscholar.org/paper/Clustering-with-Fairness-Constraints%3A-A-Flexible-Ziko-Granger/d56841fe68f2a913583a40edf541efeaed0a7e5b

Lamy, Alexandre, Ziyuan Zhong, Aditya Menon, and Nakul Verma. 2019. “Noise-Tolerant Fair Classification.” https://www.semanticscholar.org/paper/Noise-tolerant-fair-classification-Lamy-Zhong/c4ac496bf57410638260196a25d8ae3366ea03c7

Iosifidis, Vasileios, and Eirini Ntoutsi. 2019. “AdaFair: Cumulative Fairness Adaptive Boosting.” https://www.semanticscholar.org/paper/AdaFair%3A-Cumulative-Fairness-Adaptive-Boosting-Iosifidis-Ntoutsi/18fe4800f3c85f315d79063d6b0fe38c7610ad45

Vaz, Afonso, Rafael Izbicki, and Rafael Stern. 2018. “Quantification under Prior Probability Shift: The Ratio Estimator and Its Extensions.” https://www.semanticscholar.org/paper/Quantification-under-prior-probability-shift%3A-the-Vaz-Izbicki/50adf7b8fd1274149a195ef4a7b4ab9f84b3dd13

Zhu, Zining, Jekaterina Novikova, and Frank Rudzicz. 2018. “Semi-supervised Classification by Reaching Consensus among Modalities.” https://www.semanticscholar.org/paper/Semi-supervised-classification-by-reaching-among-Zhu-Novikova/072956b72ddc23f276b18da0c9a6ccc5ed5067e8

Yoon, Jinsung, William R. Zame, and Mihaela van der Schaar. 2017. “ToPs: Ensemble Learning with Trees of Predictors.” https://www.semanticscholar.org/paper/ToPs%3A-Ensemble-Learning-With-Trees-of-Predictors-Yoon-Zame/05268691d4bf6b84e71ae421a3af0ab27cd3d8f1

Ross, Stéphane, Paul Mineiro, and John Langford. 2014. “Normalized Online Learning.” https://www.semanticscholar.org/paper/Normalized-Online-Learning-Ross-Mineiro/1d127af1174a3f0f36e9181348eaa731d3cca67b